Back

Frontiers in Bioinformatics

Frontiers Media SA

Preprints posted in the last 7 days, ranked by how well they match Frontiers in Bioinformatics's content profile, based on 45 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
A priority index-based computational medicine framework (PimRNA) for prioritising personalised mRNA cancer vaccines

Fang, H.; Tan, T.

2026-05-29 oncology 10.64898/2026.05.26.26354114 medRxiv
Top 0.2%
1.7%
Show abstract

Background: The development of personalised mRNA cancer vaccines holds considerable promise for oncology, yet a significant translational gap persists between neoantigen identification and the selection of therapeutically impactful targets. Current approaches predominantly prioritise human leukocyte antigen (HLA) binding affinity and immunogenicity, often overlooking the systems-level biological context of the target. This can inadvertently favour immunogenic but biologically peripheral peptides that exert limited influence on tumour signalling networks, thereby constraining vaccine efficacy. Furthermore, mRNA therapeutics must satisfy additional design requirements, including favourable codon usage and favourable secondary-structure stability, which directly affect in vivo translation and half-life. A unified computational framework that integrates neoantigen discovery with network biology is therefore critically needed. Results: Here, we present PimRNA, a Priority index (Pi)-centric computational medicine framework that bridges this gap by unifying neoantigen identification, mRNA sequence optimisation, and gene interaction network analysis. First, high-confidence tumour-specific HLA class I and II neoantigenic peptides are identified from paired tumour-normal genomic and tumour transcriptomic data using NeoDisc. Second, the coding sequences of these peptides are optimised for stability and translational efficiency with LinearDesign, yielding a core set of neoantigen-encoding mRNAs. Third, a random walk with restart algorithm is applied to a knowledgebase of gene interactions to identify peripheral genes exhibiting significant network connectivity to core genes, generating a gene-predictor matrix in which each gene is assigned an affinity score reflecting its network proximity to immunogenic neoantigens. These scores are consolidated into a single, unified priority rating (0-5) for each gene, followed by subnetwork analysis that reveals therapeutically relevant gene modules. Application of PimRNA to breast cancer and melanoma datasets demonstrates that it successfully selects high-confidence immunogenic neoantigen candidates embedded within biologically meaningful tumour-specific networks. Conclusion: PimRNA provides a systems biology foundation for mRNA vaccine design, moving beyond isolated immunogenicity to prioritise targets that are both highly presented and central to tumour-relevant biological networks. This framework offers a generalisable strategy for the rational discovery and prioritisation of mRNA therapeutics, significantly advancing the field of computational medicine towards personalised cancer vaccines.

2
Impact of AI-Assisted Mammography Reading on Quality Indicators in the Czech Breast Cancer Screening Programme: A Retrospective Study

Veverkova, L.; Dolezalova, Z.; Marackova, V.; Mathew, E.; Urbankova, M.; Ambrozova, M.; Piskovsky, T.; Ngo, O.; Majek, O.

2026-05-26 oncology 10.64898/2026.05.25.26353869 medRxiv
Top 0.7%
0.8%
Show abstract

Objectives: The aim of mammographic screening is the early detection of invasive cancers. In the era of artificial intelligence (AI), this tool may improve diagnosis of earlier stages. The purpose of this study was to assess the impact on selected quality indicators retrospectively. Method: The data source was the Breast Cancer Screening Registry using data from one Screening Unit that currently uses AI routinely. The indicators of the cancer detection rate (CDR), further assessment rate (FAR), and recall rate (RR) in the year 2023, when AI was used, and the year 2022, without AI, in women aged 45-69 were compared. The statistical evaluation used the chi-square test and logistic regression adjusting for the effects of age, a woman's risk level, and the screening round at a 5% significance level. Results: In 2022, without AI, 4,034 women aged 45-69 were included, compared with 4,049 women in 2023 when AI was used. This study showed a non-significant increase in CDR from 5.0 breast cancers detected per 1,000 women (non-AI assessment) to 5.2 (AI-assisted assessment), p = 0.919; OR (95% CI): 1.034 (0.542-1.974), a significant decrease in the FAR from 5.2% to 3.9%, p < 0.001; OR (95% CI): 0.665 (0.529-0.836), and a decrease in RR from 2.4% to 1.9%, p = 0.083; OR (95% CI): 0.754 (0.548-1.037). Conclusion: AI has the potential to be a useful tool in the early detection of breast cancer by improving quality through a decrease in FAR and RR, while probably maintaining CDR.

3
Beyond Identifier Matching: An Empirical Characterization of Failure Modes in Biomedical Knowledge Graph Integration

Hu, S.; Cheng, H.; Gillenwater, L.; Manpearl, K.; Mandava, A.; Wang, Y.; Pividori, M.; Stranger, B.; Krishnan, A.; Greene, C.; Gao, Y.

2026-05-28 health informatics 10.64898/2026.05.26.26354182 medRxiv
Top 0.8%
0.8%
Show abstract

Objective. Biomedical knowledge graphs (KGs) such as PrimeKG, Hetionet, UMLS, and PharmGKB are increasingly used as the substrate for downstream machine-learning, retrieval-augmented generation, drug-repurposing, and electronic health record (EHR) augmentation pipelines. The dominant assumption in published work is that integrating two or more such KGs is a tractable engineering step solved by identifier (ID) matching. This paper interrogates that assumption empirically. We quantify how much concept overlap survives realistic alignment, and we characterize the new failure modes introduced by the methods that practitioners reach for when ID matching is insufficient. Materials and Methods. We compared four widely used biomedical KGs (PrimeKG, Hetionet v1.0, the full UMLS Metathesaurus, and PharmGKB) across eleven node types using a tiered alignment pipeline: (1) direct ID matching for nodes sharing a primary vocabulary; (2) cross-ontology bridging using standard mappings (e.g., MONDO-DOID, HPO-UMLS, HPO-UMLS-MeSH for side effects, NCBI Gene-HGNC-UMLS, UBERON-FMA/SNOMEDCT_US/NCI/MeSH for anatomy); (3) ClinicalBERT cosine-similarity grouping at threshold >= 0.98 for over-segmented disease nodes, with a deterministic suffix-stripping canonicalizer; (4) exact name matching for ontology-poor types (anatomy, REACTOME pathways); and (5) embedding-based fuzzy matching with UMLS lookup (SapBERT and ClinicalBERT) for free-text microbiome concepts. We applied the pipeline to a 698-concept gut-microbiome benchmark spanning taxa, pathways, and disease labels, validated grouping decisions against the curated SSSOM mappings released by the MONDO project, and audited the ClinicalBERT consolidation against five clinical-genetics case studies drawn from the literature. Results. Per-type pairwise coverage was strikingly asymmetric. Genes/proteins and the three Gene Ontology categories aligned cleanly across PrimeKG and Hetionet (mutual coverage 94-99%), but disease overlap was sparse: only 0.7% of PrimeKG individual disease nodes mapped to Hetionet, rising to 2.0% after MONDO grouping (versus 78.7% and 18.4% from the Hetionet side). PrimeKG-to-UMLS coverage spanned 100% (effect/phenotype via HPO) down to 20.8% (REACTOME pathways), with drugs at 73.7% and anatomy at 58.8%. PrimeKG-to-PharmGKB drug coverage required up to two bridging hops (DrugBank -> UMLS -> RxNorm/ATC/MeSH). Bigger was not uniformly more complete: on a 698-concept microbiome drug benchmark, Hetionet missed 0 concepts while PrimeKG missed 16. ClinicalBERT-based grouping consolidated 22,205 raw MONDO disease nodes into 17,080 groups but introduced three reproducible failure modes documented in case studies: (i) peer over-merging: for example, all 22 osteogenesis imperfecta subtypes collapsed into a single node despite distinct severity classes; (ii) parent-child collapse: e.g. acute myeloid leukemia merged with myeloid leukemia, erasing the acute/chronic distinction that drives clinical management; and (iii) lexical false positives: neurofibromatosis and schwannomatosis grouped together despite cellular-pathology differences. Discussion. Identifier matching alone is a weak baseline for biomedical KG integration. Cross-ontology bridges and embedding-based consolidation expand coverage but do so at the cost of clinically meaningful resolution, and the resulting failures are systematic rather than random. Reporting only aggregate coverage statistics obscures these losses, which propagate silently into downstream tasks. Conclusion. We provide reusable per-type coverage tables, a taxonomy of three integration failure modes, and concrete recommendations for downstream studies that depend on a unified biomedical KG. We argue that future KG integration work should report per-type coverage and per-cluster confidence rather than aggregate match rates.

4
Development and Validation of a Machine Learning Model to Predict Prognosis in Patients with Advanced Head and Neck Cancer

Zhang, K.; Gao, L.; John, D.; Li, W. T.; Hogarth, M.; Coffey, C. S.; Ongkeko, W. M.

2026-05-28 oncology 10.64898/2026.05.27.26354194 medRxiv
Top 2%
0.5%
Show abstract

Importance Prognostic tools beyond staging are needed to guide treatment and counseling in head and neck squamous cell carcinoma (HNSCC). Objective To develop and externally validate a machine learning model predicting survival in advanced HNSCC using routinely collected clinical and biomarker data. Design, Setting, and Participants Retrospective, multi-institutional cohort study including 2,385 patients with stage III-IV HNSCC diagnosed from 2012-2022 in the University of California Health Data Warehouse (UCHDW). Patients were randomly split into training (n = 1,908) and test (n = 477) sets. Partial external validation used 7,749 patients from the Surveillance, Epidemiology, and End Results (SEER) registry (2010-2020). Exposures Demographic, tumor, treatment, comorbidity, and biomarker variables recorded at or before diagnosis. Main Outcomes and Measures The primary outcome was all-cause mortality within 70 months. Cox proportional hazards models included all predictors. Discrimination was assessed with Harrell's concordance index (C-index), calibration with predicted vs observed survival, and stratification with Kaplan-Meier curves. A Random Survival Forest (RSF) was trained for benchmarking and interpretability using Shapley Additive exPlanations (SHAP). Results Among 2,385 patients in UCHDW (median age, 63 years; 29.0% mortality), the Cox model achieved a C-index of 0.735 in the internal test set. Risk quartiles showed clear separation on Kaplan-Meier curves (log-rank p < 0.0001). In the SEER cohort (n = 7,749), where only demographic, staging, subsite, and treatment variables were available, the reduced Cox model achieved a C-index of 0.688, with calibration showing modest underestimation of survival in high-risk groups. Age, T stage, Charlson Comorbidity Index, neutrophil-to-lymphocyte ratio, and platelet count were among the strongest predictors, while surgery was associated with improved survival. The RSF achieved a C-index of 0.758 internally, with SHAP highlighting nonlinear effects of albumin, BMI, and inflammatory markers. Conclusions and Relevance A machine learning model using routine clinical and biomarker data demonstrated good prognostic performance in advanced HNSCC, with partial external validation. Such approaches may support individualized survival estimates, risk stratification, and treatment discussions, but broader validation is required before clinical adoption.

5
The Verification Gap: Artificial Intelligence Adoption, Hallucination Awareness, and Verification Practices Among Early Career Medical Researchers in Pakistan

Sajjad, M.

2026-05-30 health informatics 10.64898/2026.05.28.26354373 medRxiv
Top 2%
0.4%
Show abstract

Artificial intelligence (AI) tools have been rapidly adopted by medical researchers, yet whether early career researchers in low and middle income countries possess the awareness and habits needed to use these tools safely remains poorly documented. This study characterized AI adoption patterns, hallucination awareness, and verification and disclosure practices among early career medical researchers in Pakistan. A cross sectional anonymous online survey was conducted among medical students, house officers, residents, physicians, and faculty involved in research or academic work across Pakistan (May 2026). Descriptive statistics and chi square tests were applied to 373 eligible responses. AI use was near universal (99.7%), with 60.3% using AI tools daily. The most commonly reported tool in this sample was Claude (40.5%), followed by ChatGPT (29.2%) and Perplexity (26.0%), though this ranking likely reflects sampling characteristics. Despite high adoption, 59.2% typically did not verify AI outputs before use, and 40.2% had never heard that AI can generate fabricated scientific references. In behavioral vignettes, 36.5% assumed convincing AI generated references were authentic, and 54.2% would continue using remaining AI content after discovering one fabricated reference. Formal research training was strongly associated with consistent disclosure (51.7% vs. 17.1%; chi square=48.43, p less than 0.001). Role, daily use frequency, and research training were not significantly associated with verification behavior. Early career medical researchers in Pakistan demonstrate high AI adoption alongside incomplete hallucination awareness and infrequent verification, a pattern that may carry implications for research integrity. Formal training was the only factor significantly associated with consistent disclosure. Integration of AI literacy into medical curricula and institutional governance frameworks merits consideration.

6
Connecting Baseline Immune Exhaustion in Hot Tumors to Oral Cancer Recurrence and Nodal Metastasis

Shaikh, S.; Basu, S.; Hajihosseini, M.; Nandy, S. K.; Moorthy, M.; Arun, I.; Lali, B. S.; Arun, P.; Mukherjee, G.; Pyne, S.

2026-05-30 oncology 10.64898/2026.05.27.26354295 medRxiv
Top 2%
0.4%
Show abstract

Background: The use of immune checkpoint inhibitors (ICIs) in the treatment of cancer has rapidly expanded over the last decade. However, there are several knowledge gaps in understanding how tumor cells evade the immune system. There is paucity of data in HPV negative oral cancer, particularly of the gingivobuccal region. Understanding the mechanism of immune system evasion in this cancer is vital for improving patient outcomes. Methods: We characterized the baseline immune milieu of oral cancer using immunohistochemistry (IHC) on whole tumor sections from 124 cases. Tumors were classified as hot or cold and further stratified into high-risk and low-risk groups. High-risk patients included those with lymph node metastasis at diagnosis/recurrence or distant metastasis within 2 years of treatment completion. Patients without these features were categorized as low risk. Validation by RNA-Seq and Joint Enrichment Analysis of Oncogenic and Immunologic Pathways was carried out in a subset of 46 cases. Results: Hot high-risk tumors (by IHC) were distinguished by elevated PD-L1 expression and reduced NK-cell, PD1, and CTLA-4 expression. There was no difference in the expression levels of CD3+, CD8+, granzyme, or perforin compared to hot low-risk tumors, findings that align with the definition of hot tumors. RNA-Seq revealed a gene signature associated with exhausted T-cells in hot high-risk tumors. Gene and pathway analyses identified differential upregulation of isoform-specific TOX, TCF, CXCR, RUNX, IRF, BRD and BCL6 genes, implicating immune cell exhaustion and tumor aggressiveness. Significantly downregulated genes included PDCD1, HAVCR2, ZAP70, and STAT, indicative of a disabled immune microenvironment. These findings support that a state of immune exhaustion in HHR tumors is driven by progenitor exhausted T-cells and terminally exhausted T-cells; independent of PD1-TIM3. Conclusion: These findings suggest that combining TOX/TCF/BCL6 inhibitors with immune checkpoint inhibitors in the adjuvant setting might benefit patients with hot high-risk tumors. Given the results, testing for a targeted exhaustion-related gene panel at diagnosis is recommended for oral cancers to stratify tumors as high-risk or low-risk. Larger validation studies and clinical trials are now warranted.

7
Using artificial intelligence for radiotherapy clinical trial quality assurance: analysis of a multi-institutional clinical trial for neurovascular-sparing prostate stereotactic ablative radiotherapy

Doucette, M.; Zhang, Y.; Liao, C.-Y.; Lin, M.-H.; Yan, Y.; Dess, R. T.; Tendulkar, R. D.; Garant, A.; Hannan, R.; Jiang, S.; Nguyen, D.; Desai, N.; Yang, D. X.

2026-05-29 health informatics 10.64898/2026.05.27.26354252 medRxiv
Top 2%
0.3%
Show abstract

Our study evaluated whether a deep learning auto segmentation model combined with machine learning triage can streamline radiotherapy clinical trial quality assurance (QA). We analyzed 107 stereotactic ablative radiotherapy (SABR) cases from a multi-institutional phase II clinical trial of neurovascular sparing prostate SABR, focusing on physician contours of the internal pudendal artery (IPA) as a novel organ-at-risk with substantial interobserver variability. Contours were scored by the trial principal investigator as Per-Protocol or Minor Deviation/Unacceptable. We applied a deep learning model for IPA auto-segmentation. Agreement between human and AI contours was then quantified using 14 overlap, distance, and surface metrics, and a supervised classifier was trained on these metrics to flag clinical trial protocol deviations. While AI segmentation achieved only modest geometric accuracy with mean Dice similarity coefficient of 0.446 and 95th percentile Hausdorff distance of 14.23, when incorporating all 14 metrics, a machine learning classifier yielded AUROC of 0.836, flagging all Minor Deviation/Unacceptable cases with 100% sensitivity on the 27 case hold-out set with 6 false positives and no false negatives. AI segmentation combined with metrics-based machine learning can triage protocol deviations within a multi-institution radiotherapy clinical trial, supporting prospective evaluation of AI-assisted trial QA.

8
Translational bioinformatics and machine learning framework for biomarker discovery, disease prediction, and patient profiling for precision medicine

Ahmed, Z.; Govindareddy, P.; DeGroat, W.; Narayanan, R.; Peker, E.; Zeeshan, S.

2026-05-27 genetic and genomic medicine 10.64898/2026.05.23.26353961 medRxiv
Top 3%
0.3%
Show abstract

Precision medicine aims to advance our ability from a "one-size-fits-all" approach to personalized and predictive healthcare across diverse populations. It promotes integration of multi-omics and phenotypic data to understand disease mechanisms and discover novel biomarkers and risk factors, which could be used to predict and prevent critical diseases in individual patients across diverse populations. The potential implications of precision medicine approach can accelerate our ability to classify patients at higher risk of developing critical diseases, improve diagnostic capabilities, develop deeper understanding of individual risk, investigate racial differences and demographic characteristics, and find relationships between genetic variants, expressions, and diseases. This study focuses on implementing an innovative and data driven framework of translational bioinformatics and Machine Learning (ML) techniques to analyze multi-omics, including RNA-seq and Whole-Genome Sequencing (WGS) data, generated using blood samples of randomly consented patients. First, we utilized bioinformatics pipelines to identify differentially expressed genes and their pathogenic and likely pathogenic variants for the downstream data analysis, annotation, and visualization. Then, applied a nexus of ML models for multi-omics biomarker discovery, disease prediction, density-based clustering, single-patient profiling, and pathogenicity classification. WGS data analysis supported the exploration of genetic variation and diversity among patients to identify known and novel biomarkers, whereas RNA-seq data analysis improved our understanding of functional and biological pathways that underlying disease states. We classified and clustered pathogenic variants and expressions across various genes and discovered numerous diseases leading risk factors. Our results include gene-disease associations and captured common pathways across the broader population, demonstrating a level of sensitivity and accuracy that has broad clinical implications. We validated our results through clinical records, and state of the science literature. This study delves into the strengths of multi-omics data integration and capabilities of ML application in genetically diverse and complex patient cohorts. Our approach has the potential to elucidate complex gene-disease interactions for genetically diverse populations, which can support earlier diagnoses for patients in many disease realms.

9
Future Pandemics: AI-Designed Diagnostic Assays for Detection of Andes Orthohantavirus (ANDV) Associated with the 2026 MV Hondius Outbreak

MacSharry, J.; Tonda, A.; Lopez-Rincon, A.

2026-05-27 health informatics 10.64898/2026.05.26.26354101 medRxiv
Top 3%
0.3%
Show abstract

Andes orthohantavirus (ANDV), the primary etiological agent of hantavirus pulmonary syndrome (HPS) in South America, is uniquely capable of limited human-to-human transmission, posing a significant challenge for outbreak control. Recent events, including the 2018-2019 Epuyen outbreak and the 2026 MV Hondius incident, underscore the need for rapid, lineage-specific molecular diagnostics. In this study, we present an artificial intelligence (AI)-driven framework for the design of diagnostic primers targeting the S genomic segment of the Epuyen lineage. Using an evolutionary algorithm integrated with thermodynamic evaluation via Primer3Plus, candidate primers were optimized to maximize classification accuracy while satisfying stringent biochemical constraints. The resulting primer set enables amplification of lineage-specific regions suitable for molecular characterization and surveillance. In silico validation demonstrates that the proposed primers achieve perfect discrimination between 2026 outbreak sequences and other ANDV variants. Furthermore, in silico comparison with standard protocol-based primers reveals substantially reduced sensitivity and specificity in the latter, highlighting the limitations of static diagnostic designs when applied to evolving viral populations. Overall, this work demonstrates that AI-assisted primer design provides a robust and adaptable strategy to improve viral detection, enhance outbreak tracking, and support timely public health interventions. Integrating computational optimization into diagnostic development is essential for strengthening preparedness against emerging zoonotic threats.

10
A Consensus-Driven Stacking Ensemble Framework for Interpretable Cardiovascular Risk Prediction and Clinical Deployment

Sozol, S. S.; Dev Nath, B. C.; Fahim, F. M. S.; Suzana, N. N.; Mirza, J. F.; Ahmmed, S.; Zohra, F.-T.; Zafr, A. H. A.; Uddin, M. N.; Mondal, M. R. H.; Hoque, A. S. M. L.

2026-05-26 health informatics 10.64898/2026.05.18.26352989 medRxiv
Top 3%
0.3%
Show abstract

Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.

11
Dried blood spot proteomics as a diagnostic framework for citrin deficiency

Totsune, E.; Nakajima, D.; Konno, R.; Mikami-Saito, Y.; Arai-Ichinoi, N.; Nishida, H.; Yagi, H.; Ishige, T.; Suzuki, H.; Shirota, M.; Takayama, J.; Takano-Asai, C.; Shimura, M.; Sasai, H.; Lee, T.; Kido, J.; Nakajima, Y.; Kobayashi, H.; Kikuchi, A.; Numakura, C.; Hamazaki, T.; Oishi, K.; Nakamura, K.; Kawashima, Y.; Ohara, O.; Wada, Y.

2026-05-28 genetic and genomic medicine 10.64898/2026.05.26.26354012 medRxiv
Top 4%
0.2%
Show abstract

Background: Citrin deficiency, caused by biallelic pathogenic variants in SLC25A13, must be identified early to prevent serious complications such as hyperammonemia and liver failure. However, clinical diagnosis is often delayed due to its nonspecific presentation and limited sensitivity of amino acid-based newborn screening methods. Although genome-based evaluations are being investigated to address these issues, concerns about their cost, turnaround time, variant interpretation ability, and data handling highlight the need for a more practical yet reliable alternative. We investigated the feasibility of applying proteomic approach on dried blood spots (DBS), which are routinely used in newborn screening. Methods: We performed untargeted liquid chromatography-tandem mass spectrometry to analyze the proteome of DBS using a previously developed "non-targeted analysis of non-specifically DBS-absorbed proteins" (NANDA) workflow. SLC25A13 protein abundance was quantified in individuals with biallelic loss-of-function mutations, compound loss-of-function/missense mutations, and heterozygous carriers; this was also evaluated in healthy and diseased controls representing relevant differential diagnoses. To leverage proteomic information, we derived a multivariate proteomic signature using feature selection and evaluated its performance with leave-one-out cross-validation. Biological relevance was assessed by enrichment analysis, and complementary transcriptomics was performed using RNA sequencing. Results: A total of 7,474 proteins, including SLC25A13, were consistently detected in DBS. SLC25A13 was undetectable in individuals with biallelic loss-of-function mutations. However, individuals with compound loss-of-function/missense genotypes showed reduced but measurable SLC25A13 levels, comparable to those observed in heterozygous carriers. In contrast, a compact 15-protein signature accurately identified individuals with compound loss-of-function/missense genotypes (AUC, 0.99; sensitivity, 1.00; specificity, 0.95). The signature was enriched for Ca2+-response, and transcriptomics showed downregulation of genes related to multimodal ion channels in affected individuals compared to controls. Conclusions: DBS-based proteomic profiling may assist in the diagnosis of citrin deficiency through SLC25A13-quantification and a biologically plausible multivariate signature. More broadly, this strategy offers a promising new diagnostic layer for protein disorders, providing a proteomic readout in a clinically practical DBS format with potential utility for future diagnostic and screening applications.

12
Immune Checkpoint Response Profiles and Resistance Mechanisms in NSCLC Revealed by Circulating Extracellular Vesicle Proteomics

Taylor, C.; Davey, M.; Allain, E. P.; Cheema, A. S.; Crapoulet, N.; Finn, N.; Abd, M.; Ouellette, R.

2026-05-26 oncology 10.64898/2026.05.25.26354042 medRxiv
Top 4%
0.2%
Show abstract

Background: Immune-oncology has revolutionized cancer treatment, but some patients fail to benefit due to primary resistance and tumour-immune evasion. Extracellular vesicles (EVs) are secreted by both tumour and immune cells and mediate communication between cancer cells and the immune system. Our study used proteomic profiling of circulating EVs collected from NSCLC patients treated with immune checkpoint inhibitors (ICI) to identify predictive biomarkers of response as well as immune evasion mechanisms related to treatment resistance. Methods: EVs were isolated from plasma collected prior to ICI treatment using peptide-affinity purification and high-throughput proteomics was performed using Proximal Extension Assay. Differentially expressed EV proteins between durable (DR) and non-durable responders (NDR) were identified and evaluated using Cox proportional hazards regression, survival analysis, sex-stratified analysis, as well as pathway and network analysis. Results: Proteomics analysis identified 116 differentially expressed EV proteins between DR and NDR. NDR was characterized by enrichment of inflammatory, angiogenic, and immune-suppressive EV proteins, such as IL1RL1, TFRC, IL6ST, galectins, TNF superfamily death receptors, chemokines, and PCSK9. Pathway analysis revealed enrichment of angiogenesis, chemotaxis, ECM remodeling, and neutrophil degranulation associated with poor progression-free survival (PFS). In contrast, DR to ICI treatment was associated with EV proteins related to T- and B-cell activation and adaptive immunity. Sex-related differences in abundance and association with PFS was observed for certain EV proteins, including IL1RL1 and TFRC. A six protein EV model (IL1RL1, TFRC, ERI1, CCN5, IGFBPL1, and TNFRSF13C) demonstrated good prognostic performance for identifying NDR (AUC = 0.907) and stratified patients into three discrete risk groups. Conclusions: High-plex EV proteomics revealed biologically coherent tumour-immune signaling programs that are associated with ICI treatment resistance. Profiling circulating EVs may improve our understanding of EV-mediated immune evasion mechanisms and identify protein signatures that reflect the tumour immune microenvironment and predict response to immune checkpoint blockade.

13
A Retrospective Evaluation of the Microsoft Healthcare Agent Orchestrator for Tumor Board Patient Summaries

Roy, J.; Korleski, J. B.; Augustin, R. C.; Yefet, L.; Jensen, Z. D.; Ehman, E. C.; Zadeh, G.; Conners, A. L.; Tevaarwerk, A. J.; Korfiatis, P.

2026-06-01 health informatics 10.64898/2026.05.22.26353812 medRxiv
Top 4%
0.2%
Show abstract

Background: Preparing tumor board patient summaries is time intensive. Large-language-model based systems may automate summarization but require real-world evaluation prior to clinical use. We performed an exploratory retrospective evaluation of the Microsoft Healthcare Agent Orchestrator (HAO), deployed in a Mayo Clinic controlled staged environment, to generate tumor board-style patient summaries from retrospective Electronic Health Record (EHR) notes. Methods: HAO generated summaries for breast, hepatobiliary, and neuro-oncology tumor board cases using up to the most recent 1,000 clinical notes. Clinician reviewers evaluated outputs via REDCap surveys across perceived factuality, completeness, clarity/conciseness, temporal cohesion, comparative performance, safety, and clinical utility (0-4 Likert scale). Reviewers were permitted to query the HAO chat interface to address missing details. Automated factuality was assessed using TBFact (bidirectional entailment), reporting precision and recall against available reference summaries. Results: Among 57 survey responses from 5 different physicians, mean scores exceeded 2.8 across domains, with medians of 3 for most axes. In an exploratory comparison, oncology fellows required less time to review HAO-generated summaries than to manually generate patient summaries (mean difference 13.57 minutes per patient, p<0.001), although this difference may be influenced by prior familiarity with the same cases; 96% of survey responses indicated that HAO would save time. TBFact evaluations showed higher recall than precision across domains, consistent with broad capture of reference content alongside additional content that was not present in gold-standard summaries. Attribution was viewed favorably but showed issues with primary-source specificity and link reliability. Conclusions: In a controlled Mayo environment, HAO demonstrated moderate performance and was associated with reduced review time for tumor board preparation. These findings are promising but preliminary and do not establish clinical safety, noninferiority to manual review, or readiness for routine clinical use. Limitations, including verbosity, specialty-specific content gaps, and inconsistent attribution, highlight the need for iterative refinement and further evaluation.

14
Deep Learning Spatial Profiling of CD103+CD8+ T Cells and Survival in Rectal Cancer After Neoadjuvant Chemoradiotherapy

Abe, T.; Yamashita, K.; Nagasaka, T.; Fujita, M.; Ueda, Y.; Miyake, S.; Ito, R.; Adachi, Y.; Ando, M.; Tsuneki, T.; Okazoe, Y.; Konaka, R.; Takahashi, T.; Kagiyama, H.; Tachibana, T.; Imai, M.; Yoshida, T.; Saito, M.; Mukohyama, J.; Kanayama, K.; Koma, Y.-I.; Otowa, Y.; Hasegawa, H.; Ikeda, T.; Koterazawa, Y.; Aoki, T.; Harada, H.; Urakawa, N.; Goto, H.; Kanaji, S.; Yanagimoto, H.; Matsuda, T.; Takamura, S.; Yamashita, T.; Sasaki, R.; Fukumoto, T.; Kakeji, Y.

2026-05-28 oncology 10.64898/2026.05.26.26353629 medRxiv
Top 5%
0.2%
Show abstract

Background: CD8+ tumor-infiltrating lymphocytes (TILs) are established prognostic markers in colorectal cancer, yet the clinical significance of CD103+CD8+ tissue-resident memory-like (TRM-like) T cells in locally advanced rectal cancer (LARC) after neoadjuvant chemoradiotherapy (NACRT) remains unknown. Methods: We quantified CD8+ and CD103+CD8+ T-cell densities in stromal and intratumoral compartments of post-NACRT resection specimens from 40 LARC patients using Cu-Cyto, a deep learning-based imaging cytometry platform. Associations with survival, pathological response, and adjuvant chemotherapy (AC) were examined. Treatment-induced T-cell dynamics were assessed in paired pretreatment biopsies and post-NACRT resections (n = 9). Results: High stromal CD103+CD8+ density independently predicted better 5-year RFS (67.4% vs. 12.1%, p < 0.001) and OS (80.0% vs. 26.6%, p = 0.016); intratumoral density showed no prognostic significance. Pathological response correlated with stromal CD8+ but not CD103+CD8+ density. Paired analysis revealed a selective non-expansion of the CD103+ subset: stromal CD8+ T cells increased significantly after NACRT while CD103+CD8+ density remained unchanged. AC may preferentially benefit patients with low stromal CD103+CD8+ density. Conclusions: Stromal CD103+CD8+ T-cell density is a robust independent prognostic biomarker in rectal cancer after NACRT that appears to reflect pre-existing rather than treatment-induced immunity. Given its stability across NACRT, pretreatment biopsy assessment may provide equivalent prognostic information, with potential implications for patient stratification before treatment initiation.

15
A Foundational Exome Resource for Jordan: Dual Ancestry Admixture and Population-Specific Variants to Improve Clinical Variant Interpretation

Froukh, T.

2026-05-27 genetic and genomic medicine 10.64898/2026.05.23.26353895 medRxiv
Top 5%
0.1%
Show abstract

Currently, the genetic architecture of Middle Eastern populations is underrepresented in global genomic databases. This gap increases the rate of Variants of Uncertain Significance (VUSs) and clinical misinterpretations of genomic data especially in Middle Eastern populations. Whole exome sequencing was conducted on 90 healthy individuals from Jordan and the data were analysed using Principal Component Analysis (PCA) and multi-computational filtering. PCA revealed a double ancestry (EUR-AFR) admixture rather than a triple admixture (EUR-AFR-AMR). More than 3,500 populations-specific variants (PSVs) were identified, of which 72% were singletons. Additionally, 19 variants were significantly enriched compared to the maximum allele frequencies in public global databases (Fisher's exact test with Benjamini-Hochberg false discovery rate correction, p-value < 0.05). Consequently, the results suggest the reclassification of variants of Uncertain Significance (VUS) which reside in the ECE2 gene to likely benign and the variants of Conflicting Classification of Pathogenicity in the genes IL1RN and THPO to benign based on the significant allele frequency (AF=0.0389, p-value < 0.05). Furthermore, a pathogenic ClinVar variant was identified in a healthy individual, warranting careful interpretation. The findings underscore the importance of identifying PSVs in order to minimize or even prevent clinical misdiagnosis and highlight the unique genetic signature in Jordan. The study serves as a foundational resource for precision medicine in the region.

16
AI Decision Support for Challenging Teledermatology Cases: MedGemma Performance in the Dermatology ECHO Program

Appiagyei, J. B.; Otu, R. O.; Henry, M. K.; Casterline, B. W.; Becevic, M.

2026-05-26 health informatics 10.64898/2026.05.21.26353523 medRxiv
Top 5%
0.1%
Show abstract

Teledermatology expands access to dermatologic expertise in rural settings, yet diagnostic uncertainty persists in low-resource primary care. This retrospective study evaluated MedGemma-4B-IT, a compact multimodal vision-language model, as adjunctive clinical decision support for challenging diagnostic cases. We analyzed 77 zero-concordance cases (360 clinical photographs) from a Dermatology Extension for Community Healthcare Outcomes (ECHO) tele-mentoring program (2016-2021). Zero-concordance cases showed no overlap between primary clinician provisional diagnosis and dermatologist-confirmed diagnosis. The model was prompted using dermatologist-style format to generate ranked differential diagnoses. Performance was assessed using strict case-level top-k exact-match accuracy and relaxed matching criteria based on fuzzy string similarity. MedGemma achieved 0.0% strict top-1 accuracy, 1.3% top-3 accuracy, 3.9% top-5 accuracy, and 3.9% top-10 accuracy. Relaxed concept-level matching achieved 28.6% top-1, 63.6% top-5, and 67.5% top-10 accuracy. Image-level accuracy was 44.2% (159/360, 95% CI 39.0-49.5%). The model surfaced the correct diagnosis within differential lists in 45.5% of cases despite no exact top-1 matches, suggesting utility for differential expansion rather than definitive diagnosis. Performance varied across diagnostic categories, with highest accuracy in Other categories (54.5%) and lowest in neoplastic conditions (0.0%). Common errors included confusion between inflammatory and other diagnostic groupings. These findings characterize MedGemma performance on real-world teledermatology cases and inform safe, clinician-in-the-loop integration into teledermatology workflows where specialist oversight remains essential.

17
DKK1 and CKAP4 expression is associated with cervical lymph node metastasis in tongue squamous cell carcinoma

Fujita, H.; Takahashi, O.; Yada, N.; Tanaka, J.; Haraguchi, K.; Morioka, M.; Yaginuma, T.; Sasaguri, M.; Kokabu, S.; Habu, M.

2026-06-01 dentistry and oral medicine 10.64898/2026.05.29.26354440 medRxiv
Top 6%
0.1%
Show abstract

Objective: To identify Dickkopf-1 (DKK1) as a prognostically relevant candidate in head and neck squamous cell carcinoma and to evaluate whether DKK1 and cytoskeleton-associated protein 4 (CKAP4) expression is associated with cervical lymph node metastasis in tongue squamous cell carcinoma (TSCC). Methods: DKK1 was screened using the Human Protein Atlas Pathology Atlas. Immunohistochemical expression of DKK1 and CKAP4 was examined in 54 patients with primary TSCC (cT1-4N0) treated surgically between 2015 and 2020. Nine cases were excluded because of insufficient tissue blocks or inadequate staining quality, leaving 45 evaluable cases. Associations with delayed cervical lymph node metastasis were assessed together with conventional clinicopathological factors, including infiltrative growth pattern (INF) and pathological depth of invasion (pDOI). Results: In public database analysis, high DKK1 expression was associated with poorer overall survival in head and neck squamous cell carcinoma. In the TSCC cohort, pDOI [&ge;]5 mm and INF pattern c were significantly associated with cervical lymph node metastasis. Positive DKK1 and CKAP4 expression were also significantly associated with cervical lymph node metastasis. Furthermore, combined DKK1/CKAP4 positivity, when incorporated with INF and pDOI, provided additional risk stratification, and cases with all 3 factors showed a markedly increased likelihood of cervical lymph node metastasis. Conclusions: Expression of DKK1 and CKAP4 was associated with cervical lymph node metastasis in TSCC. Combined assessment of DKK1/CKAP4 expression with INF and pDOI may improve pathological risk stratification and may help identify patients who require closer neck evaluation and postoperative management.

18
Cancer Prevalence and Patterns in Kilifi County: A 10-year Retrospective Descriptive Study

Masha, M.; Mbugua, R. W.; Abdullahi, M.; Sheikh, N. A.; Omar, A.; Abdihamid, O.

2026-06-01 oncology 10.64898/2026.05.20.26353643 medRxiv
Top 6%
0.1%
Show abstract

Abstract Background Cancer is an increasing public health challenge in Kenya, particularly in rural and underserved regions where surveillance systems and diagnostic capacity remain limited. Kilifi County, located along the Kenyan coast, lacks a population-based cancer registry, and data on the local cancer burden is not available. This study aimed to characterize the demographic distribution of patients, cancer burden in the county, and management of cancer cases diagnosed at Kilifi County Referral Hospital (KCRH) over ten years. Methods This retrospective study analyzed the patterns of cancer in Kilifi County using patient records from KCRH during the study period (January 1, 2014, to January 1, 2024). Results A total of 101 patients with cancer were identified, 58% female, with a mean age of 54 years. Most patients were from Kilifi North (47%), with a high proportion reporting no formal occupation (41%) or farming (26%). Esophageal and cervical cancers were the most common (18% each), followed by breast and prostate cancers (5% each), with other malignancies occurring infrequently. Histopathology was the primary diagnostic modality (88%). Staging data were incomplete in 70% of cases; among documented cases, the majority presented with advanced disease (21% stage IV). Due to limited local treatment capacity, approximately half of the patients were referred to tertiary centers for chemotherapy, radiotherapy, or surgery. At data cut-off, 43% had died, 25% were on treatment, and 29% were lost to follow-up, with only 2% completing treatment or under follow-up. Conclusions This study demonstrates a substantial cancer burden in Kilifi County and highlights critical gaps in diagnostic capacity, staging, and continuity of care. Strengthening cancer surveillance systems, expanding diagnostic and treatment infrastructure, and establishing a population-based cancer registry are essential to improving cancer outcomes and advancing equitable care in rural Kenya

19
Compatibility of National Food Composition Databases with USDA FoodData Central: A Seven-Country LLM-Based Analysis

Nakagawa, S.; Yamamoto, A.

2026-06-01 nutrition 10.64898/2026.05.23.26353942 medRxiv
Top 6%
0.1%
Show abstract

To evaluate the international interoperability of food composition databases, we assessed the compatibility of seven national food composition tables with USDA FoodData Central (FDC) using the LLM-based matching method reported previously (Nakagawa and Yamamoto, 2026). Databases from four English-speaking countries (Canada, United Kingdom, Australia, and New Zealand), South Korea, and Japan were compared with 8,158 USDA FDC entries (SR Legacy and Foundation Foods, excluding Survey/FNDDS). Match rates varied by country (62.0-89.7%) and food category. After excluding six USDA categories unsuitable for cross-national comparison, 45.2% of the remaining 6,290 entries were not matched by any country. Canada showed the highest concordance, reflecting shared North American food supply. Japan and South Korea showed similar low coverage for vegetables and spices. These findings suggest that while USDA FDC represents a practical foundation for a globally comprehensive food composition database given its breadth, systematic incorporation of country-specific foods and classification schemes will be necessary to achieve true international interoperability.

20
Redefining Extent Of Resection After Meningioma Surgery: a Multicentre Observational Machine Learning Analysis Comparing Simpson, Radiological and Volumetric Grading

Pandit, A. S.; Deehan, M.; Moudgil-Joshi, J.; Reischer, G.; Mathew, S.; Pace, G.; Fatania, G.; Dalton, A.; Nair, R.; Hyare, H.; Mallon, D.; Kitchen, N.; Marcus, H. J.; Nachev, P.

2026-05-27 oncology 10.64898/2026.05.23.26353944 medRxiv
Top 6%
0.1%
Show abstract

Background: Extent of resection remains central to meningioma management, yet Simpson grading is subjective and may not reflect measurable postoperative residual disease. We compared surgeon-reported Simpson grade, report-derived radiological grading, and residual tumour volumetry across a multicentre cohort. Methods: We performed a retrospective study across two tertiary neurosciences centres comprising four hospitals, including patients undergoing primary cranial meningioma resection from 2006 to 2025. Postoperative magnetic resonance imaging (MRI) reports were harmonised using weakly supervised natural language processing based on term frequency-inverse document frequency (TF-IDF) and a linear support vector machine classifier. Residual tumour volume was segmented from contrast-enhanced postoperative MRI and log-transformed. Concordance between Simpson and radiological gross-total/subtotal resection classification was assessed using absolute agreement and prevalence-adjusted bias-adjusted kappa (PABAK). Cox models assessed recurrence-free survival, with bootstrap validation and anatomical and scan-timing sensitivity analyses. Results: Among 912 patients, recurrence or residual progression occurred in 281. Surgical-radiological agreement was substantial but imperfect (absolute agreement 74%; PABAK 0.61), with lower agreement in skull-base and parafalcine-parasagittal tumours. In adjusted models, recurrence hazard increased with Simpson grade (hazard ratio 1.54, 95% confidence interval 1.37-1.72), radiological grade (1.92, 1.68-2.20), and log-transformed residual volume (1.20, 1.16-1.24; all p<0.0005). Optimism corrected concordance increased from Simpson grade to radiological grade and log-volumetry (0.692, 0.733, and 0.748), with this ranking preserved across sensitivity analyses. Conclusions: Imaging-based postoperative residual disease measures outperformed Simpson grade. TF-IDF-assisted report-derived grading provides a scalable bridge to volumetry, while quantitative residual volume offers the strongest prognostic representation.